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Abstract: The compound decision problem for a vector of independent 
Poisson random variables with possibly different means has half a century 
old solution. However, it appears that the classical solution needs smooth- 
ing adjustment even when there are many observations and relatively small 
means such that the empirical distribution is close to its mean. We dis- 
cuss three such adjustments. We also present another approach that first 
transforms the problem into the normal compound decision problem. 

1. Introduction 

In this paper we consider the problem of estimating a vector A = (Ai, . . . , A„), 
based on observations Yi,...,Y n , where Yi ~ Po(Xi) are independent. The 
performance of an estimator A is evaluated based on the risk 

£a||A-A|| 2 , (1) 

which corresponds to the loss function 

L 2 (A,A) = ^(A,-A i ) 2 . 

Empirical Bayes (EB) is a general approach to handle compound decision 
problems. It was suggested by Robbins, see (1951, 1955); see Copas (1969) and 
Zhang (2003) for review papers. Suppose we assume that A^, i = 1, . . . ,n are 
realizations of i.i.d. Aj, i = 1, . . . , n, where A, ~ G, then a natural approach is 
to use the Bayes procedure: 

5 G = argmin E G (S(Y) - A) 2 , (2) 
s 
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and estimate A by A = (S . . . , S G (Y n ). When G is completely unknown, 

but it is assumed that Ai, . . . , A„ are i.i.d., then it may be possible to estimate 
S G from the data Yi, . . . ,Y n , and replace it by some S G . Optimal frequentist 
properties of S G in the context of the compound decision problem, are described 
in terms of optimality within the class of simple symmetric decision functions. 
See the recent paper by Brown and Greenshtein (2009) for a review of the 
topic. The optimality of the empirical Bayes decision within the larger class of 
pcrmutational invariant decision functions is shown in Greenshtein and Ritov 
(2009). 

The Bayes procedure 5 has an especially simple form in the Poisson setup. 
In this case there is also a simple and straightforward estimator S G for S G . 
Denote by P the joint distribution of (A, Y), which is induced by G. The Bayes 
estimator of Ai given an observation Yi = y, is: 



kg, s- sv* iv _ ,_JXP(Y i =y\A i = X)dG(X) 



5 G {y) = E{K i \Y i = y) = 



JP(Y i = y\A i = X)dG(X) 
(y + l)Py(y + l) 

Py{y) 



where Py is the marginal distribution of Y under P. Given Y\, . . . , Y n , we may 
estimate Py(y) trivially by the empirical distribution: Py(y) = #{i\Yi = y}/n. 
We obtain the following Empirical Bayes procedure 

= <» + f^ + 1 > . (4) 
Py{v) 

This is currently the "default" / "classical" empirical Bayes estimator in the 
Poisson setup, suggested initially by Robbins (1955). Various theoretical results 
established in the above mentioned papers and many other papers, imply that 
as n — > oo, the above procedure will have various optimal properties. This is 
very plausible, since as n — > oo, Py — > Py and thus S G — > 8 G . However, the 
convergence may be very slow, even in common situationsn as demonstrated in 
the following example, and one might want to improve the above S G . This is 
the main purpose of this work. 

Example 1: Consider the case where n = 500 and A; = 10, i = 1, . . . , 500. 
The Bayes risk of 8 for a distribution/prior G with all its mass concentrated 
at 10 is, of course, 0. The risk of the naive procedure which estimates Ai by Yi, 
equals the sum of the variances, that is, 10 x 500 = 5000. In 100 simulations we 
obtained an average loss of 4335 for the procedure (4), which is not a compelling 
improvement over the naive procedure, and very far from the Bayes risk. 

We will improve 5 mainly through "smoothing" . A non-trivial improvement 
is obtained by imposing monotonicity on the estimated decision function. By 
imposing monotonicity without any further step, the average total loss in the 
above example in 100 simulations is reduced to 301; by choosing a suitable 
smoothing parameter (h = 3, see Section 2 below) and imposing monotonicity, 
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the average loss is reduced further to 30. Early attempts to improve (4) through 
smoothing, including imposing monotonicity, may be found in Maritz (1969) 
and references there. 

The rest of the paper is organized as follows. In Section 2 we will suggest 
adjustments and improvements of 8 G . In Section 3 we describe the alternative 
approach of transforming the Poisson EB problem to a normal EB problem, us- 
ing a variance stabilizing transformation. In Section 4 we discuss some decision- 
theoretic background, and in particular we examine loss functions other than 
the squared loss. In Section 5 we discuss the above mentioned two approaches 
and compare them in a simulation study. Both approaches involve a choice of 
a "smoothing-parameter" . A choice based on cross-validation is suggested in 
Section 6. In Section 7 we present an analysis of real data, describing car acci- 
dents. Finally, in Section 8 we briefly describe a further third approach, which 
estimates 5 using a nonparametric MLE. 

2. Adjusting the classical Poisson empirical Bayes estimator 

In the Introduction we introduced the Bayes decision function 6 G and its straight- 
forward estimator 5 G . Surprisingly, it was found empirically that even for n 
relatively large, when the empirical distribution is close to its expectation, the 
estimated decision function should be smoothed. We discuss in this section how 
this estimator can be improved. The improvement involves three steps, which 
finally define an adjusted Robbins estimator. 

2.1. Step 1 

Recall the joint probability space defined on (Y, A). We introduce a r.v. N ~ 
Po(h), where N is independent of Y and A. Let Z = Y + Af. Consider the 
suboptimal decision function 

S hA (z) = E(A\Z = z) = E(A + h\Z = z) - h. (5) 

The above is the optimal decision rule, when obtaining the corrupted obser- 
vations Zi = Yi + Ni, i = 1, . . . , n instead of the observations Y± , . . . , Y n . The 
"corruption parameter" h is a selected parameter, referred to as "smoothing pa- 
rameter" . In general, we will select smaller h as n becomes larger. See Section 6 
for further discussion on the choice of h. Motivated by (5) and reasoning similar 
to (4), we define 6h,i as: 

k,{ Z )= {Z + l)Pz{Z + l) ~h. (6) 
' U ^ Pz(z) 

where the distribution Pz(z) is defined by 

Pz{z) = J2 Mi) x exp(-^) 7 ^— . (7) 
i=0 ^ '' 
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Note that Pz{z) as defined in (7) involves observation of Y through the quan- 
tity Py(y) that appears inside its definition. It is — in general — a much better 
estimate of Pz(z) than is the empirical distribution function #{i : Z% = z}. 

2.2. Step 2. 

There is room for considerable improvement of Sh.i- Note that Sh.i is applied 
to the randomized observation Zi. Therefore, the natural next adjustment is 
Rao-Blackwcllization of the estimator. Define 

h,2{y)=E H {8 hA {y+N)), (8) 

for j\f ~ Po(h), which is independent of the observations Yj, i = 1, . . . ,n. That 
is, 

°° hP 

Note that for a given y, the value of Sh,2(y) depends on all of Py(0), Py(l), ■ . ■ , 
although mainly on the values in the neighborhood of y. 

2.3. Step 3 

Finally after applying adjustments 1 and 2 we obtain a decision function which 
is not necessarily monotone. However, because of the monotone likelihood ratio 
property of the Poisson model, S G is monotone. A final adjustment is to impose 
monotonicity on the decision function Sh,2- We do it through applying isotonic 
regression by the pulling adjacent violators, cf. Robertson, Wright, and Dyk- 
stra (1988). Note, the monotonicity is imposed on 8h,2 confined to the domain 
D(Y) = {y : Yi = y for some i = 1, . . . , n}. To be more explicit, an estimator 
is isotonic if 

y h y 3 G D(Y) and Vi < Vj <% ; ) < 6{ Vj ), (9) 
and 5h,3 is isotonic and satisfies 

n n 

y^(4,3(yQ - ShAVi)) 2 = min{y^(^(y») - '8 h . 2 {yi)) 2 : 5 satisfies (9)|. 

i=l i=l 

We obtain the final decision function 5h,3, after this third step. 
In order to simplify notations we denote: = ^,3. This is our adjusted 
Robbins estimator. 

Finally we remark on a curious discontinuity property of A^,. The function A/j 
is a random function, which depends on the realization y = (z/i, . . . , y n ). In order 
to emphasize it we write here = A^. Consider the collection of functions 
parameterized by h, denoted {A y ^(y)}. It is evident from the definition of (6), 
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that Ay t h{y) does not (necessarily) converge to A y fl(y) as h approaches 0, even 
for y in the range y\, . . . ,y n . This will happen whenever there is a gap in the 
range of y. Suppose, for simplicity that Py{y) = 0, while Py(y — 1), Py(y + 1) > 
O.Then, lim/^o 4,1(2/-!) = 0, and lim/^o hS hA (y) = (y + l)P Y (y + l)/ P Y (y- 
1). Hence 

lim S h . 2 (y - 1) = lim E[ 5 h ,i(y - 1 + N)\yi, . . . ,y n ) 

h— >0 h— s-0 \ / 

= lim ((1 - h)S hA (y - 1) + W M (y)) 

which is strictly different from ^0,2(2/) = 0. Suppose that P y (y) > and Py(y + 
jo) > for some jo > 1, but P(y + j) = for j = 1, . . . , j ~ 1- Then one can 
check directly from the definition that lim/j_>o 5h,2 = y + io- Note that in such 
a situation 5 G (y) = 0. Hence ^,2(2/) for small to moderate ft. seems preferable 
to 5 (y) = ^0,2(2/) in such gap situations. 

This phenomena is reflected in our simulations, Section 5, especially in Table 

5. 

Another curious feature of our estimator is when applied on y m ax = max{Yi, Y n }. 
It may be checked that: 5h,2{ymax) = {Umax + + 0(h 2 ). When h is small so 
that (y m ax + <^ Vmax, this would introduce a significant bias. Hence, choos- 
ing very small h, might be problematic, though this bias is partially corrected 
through the isotonic regression. An approach to deal with this curiosity could 
be to treat y max separately, for example decide on the value of the decision 
function at the point y m ax through cross-validation as in Section 6, trying a few 
plausible values. 

3. Transforming the data to normality. 

The emprical Bayes approach for the analogous normal problem has also been 
studied for a long time. See the recent papers of Brown and Greenshtein (2009) 
and of Wenhua and Zhang (2009) and references there. The Poisson problem and 
the derivation of (4) are simpler and were obtained by Robbins at a very early 
stage, before the problem of density estimation, used in the normal empirical 
Bayes procedure, was addressed. In what follows we will consider the obvious 
modification of the normal method to the Poisson problem. 

In the normal problem we observe Zi ~ N(Mi,a 2 ), i = 1, ...,n where 
Mi,...,M n are i.i.d. random variables sampled from G and the purpose is 
to estimate /Lti, . . . , jjb n the realizations of M 1; . . . , M n . The application of the 
normal EB procedure has a few simple steps. First we transform the Poisson 
variables Y\,...,Y n to the variables Z; = 2 * y/Yi + q. Various recommenations 
for q are given in the literature, the simpler and most common one is q = 0, 
the choice q = 0.25 was recommended by Brown et. al. (2005, 2009). We now 
treat Zi's as (approximate) normal variables with variance a 1 = 1 and mean 
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2 * vAi, and estimate their means by pn, through applying normal empirical 
Bayes technique; specifically, fii = <5at,/i(Zj), as defined in (11) below. Finally 
we estimate \ = EY i: by Aj = j/if . 

We will follow the approach of Brown and Greenshtein (2009). Let 

It may be shown that the normal Bayes procedure denoted 6^, satisfies: 

8%(z)=z + * 29 -^. (10) 

The procedure studied in Greenshtein and Brown (2009), involves an estimation 
of 6%, by replacing g and g' in (10) by their kernel estimators which are derived 
through a normal kernel with bandwidth h. Denoting the kernel estimates by 
§h and g' h we obtain the decision function, (Z±, . . . , Z n ) x z i— > R: 

6 N , h (z) = z + a 2 ^. (11) 
9h(z) 

One might expect this approach to work well in setups where Xi are large, 
and hence, the normal approximation to = y/Yi + q is good. In extensive 
simulations the above approach was found to also work well for configurations 
with moderate and small values of A. In many cases it was comparable to the 
adjusted Poisson EB procedure. 

Remark In the paper of Brown and Greenshtein the estimator S^.h as defined 
in (11) was studied. However, just as in the Poisson case, it is natural to impose 
monotonicity. In the simulations of this paper we are making this adjustment 
using isotonic regression. Again, the monotonicity is imposed on (5jv,ft confined 
to the range {yi, y n }- We denote the adjusted estimator by 

4. The loss functions. 

The estimator 5]\r,h{Zi) = fit, may be interpreted as an approximation of the 
nonparametric EB estimator for fii = 2y / Ai, based on the (transformed) obser- 
vations Zi under the loss L(fi, a) = \\fi — a|| 2 , for the decision a — (oi, . . . , a n ). 
Thus, jfif may be interpreted as the approximation of the empirical Bayes 
estimator for A;, under the loss 

L H {\a) = ^{VXi-^f = -21og(l - D 2 H ), 

where Dh is to the Hcllinger distance between the distributions n^ > °(^i) an d 

Y[Po{ ai ). 
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Some papers that discuss the problem of estimating a vector of Poisson means 
are Clevenson and Zidek (1975), Johnstone (1984), Johnstone and Lalley (1984). 
Those and other works suggest that a particularly natural loss function in ad- 
dition to Lh and Li, denoted Lkl is 



Lkl(W) = 



(Aj - A,;) 



2 



^ A, : 



Note, Lxl also corresponds to the local Kulback-Leibler distance between the 
distributions. 

The above articles and some other literature, strongly suggest that Lkl is 
the 'most natural' general loss function. From an empirical Bayes perspective, 
the optimal decisions that correspond to those three loss functions may have 
more and less similarity, depending on the configuration. For example, when the 
prior G is concentrated on a point mass, the Bayes procedures corresponding to 
those 3 loss functions are obviously the same. Since the Lkl loss is of a special 
importance, we will briefly describe how our analysis can be modified to handle 
it. As in the case of Li loss, one may obtain that the Bayes decision under the 
Lkl loss is given for y > 1 by: 

yPv{y) 



iV(y-i)" 

The decision for y = denoted A(0), is: 

A(0) = argmin / — ~ ^ e~ x dG(\) 

a J A 

Je- x dG(X) 
J X~ l e~ x dG(X) ' 

In particular, A(0) = if G gives a positive probability to any neighborhood of 
0. 

The decision for y > 1 may be estimated as in (4) together with the three 
adjustments suggested in Section 2, along the same lines. However, we still need 
to approximate the Bayes decision A(0). Note however, that if G has a point 
mass at 0, however small, the risk will be infinite unless A(0) = 0. This is the 
only safe decision, since the cannot ascertain that there is no mass at 0. 

Note, defining Z = Y + Af, M ~ Po(h) under the KL loss as in Step 1 in 
the squared loss, might introduce instability due to small values of Pz(z — 1) in 
the denominator of Pz(z)/Pz(z — 1), e.g., for z = min{^i, Z n }. One might 
want to define the "corrupted" variable alternatively, as Z ~ B(Y,p). Then 
Z ~ Po(pX), when Y ~ Po(X). Our smoothing/corrupting parameter is p. We 
skip the details of the analogouse of steps 1-3. 

Throughout the rest of the paper, we consider and evaluate procedures ex- 
plicitly only under the Li loss. 
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5. Simulations 

In this section we provide some simulation results which approximate the risk 
of various procedures as defined in (1). Specifically for various fixed vectors 
A = (Ai, A„), we estimate E\ J2 &h(Yi) - Xi) 2 and E x J2 Anm( y i) ~ Xi) 2 , for 
various values of h. Our approach is of compound decision and our procedures 
are permutation invariant, it is known ( sec, e.g., Greenshtein and Ritov (2009) 
) that a good benchmark and lower a bound for the risk of our suggested proce- 
dures is nB(\); here B(X) is the Bayes risk for the problem where we observe 
A ~ G, where G is the empirical distribution which is defined by Ai,...,A„. 
The main findings are the following. As already seen in Example 1, adjusting 
the classical non parametric empirical Bayes yields a significant improvement 
in the risk. The modification of the normal empirical Bayes works well (for a 
suitable choice of h), in the Poisson case. It works surprisingly well even for 
configurations with small Ai, as may be seen in Table 2. 

The class A^ of adjusted Poisson EB procedures has advantage over the class 
Ajv./i of modified normal EB procedures when applied to configurations with 
homogeneous Aj's, see Tables 3 and 4. There are a few reasons for that. First, 
aggressive smoothing is helpful in a homogeneous setup. Thanks to the bias 
correction in Adjustment 1 and the Rao-Blackwcllization step of Adjustment 
2, the adjusted Poisson class performs relatively well when the smoothing is 
aggressive, i.e., h is large. Another reason is that in a homogeneous setup, much 
of the risk is due to large and moderate deviations, and the need to correctly 
assess those few deviations. The normal approximation might be misleading for 
moderate deviations and thus an advantage to the class of adjusted Poisson 
procedures might be expected. Another advantage of the adjusted Poisson EB 
procedures is its little sensitivity to the choice of h, compared to the modi- 
fied normal procedures. This small sensitivity is evident in the simulations and 
should be understood better theoretically. Adjustment 2 seems like a crucial 
stabilizer. Also, the simulations indicate the advantage of the choice q = \ over 
the choice q = 0. 

We elaborate on Table 1. The reading of the other tables is similar. In Table 1 
we compare the different estimators when Ai, i = 1, . . . , 200, are evenly spaced 
between 5 and 15. The table is based on 1000 simulations. We present the 
risk of A/j and Ajv,/i for various values of h. In practice h should be selected 
by cross-validation or another method, we elaborate on it in Section 6. The 
normal procedures are based on variance stabilizing transformation with both 
q = and q = \- Monotonicity is imposed on all the estimators, as described 
in Step 3, through the Iso- Regression R-procedure . The risk that corresponds 
to Ao is the risk of the classical Poisson empirical Bayes procedure ( i.e., no 
smoothing through convolution) on which monotonicity is imposed. For the 
configuration studied in Table 1, the risk of the naive procedures equals the sum 
of the variances which equals 200 * 7.5 = 1500. A good proxy of the empirical 
distribution which is defined by (Ai, A200) is U(5, 15). The Bayes risk for the 
case where A~f7(,5,15) was computed numerically and it equals approximately 
4.4, hence we have nB(\) w 200 x 4.4 = 880. In each line the number in boldface 
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Table 1 

Different EB procedures for Ai, . . . , A200 that are evenly spaced between 5 and 15 





h 





0.2 


0.4 


0.8 


1.8 


3 




risk 


1114 


1049 


1017 


994 


965 


958 




h 


0.2 


0.3 


0.5 


0.7 


0.9 


1.2 


q = 


risk 


1263 


1131 


1043 


1022 


1063 


1197 


1=\ 


risk 


1230 


1099 


1013 


997 


1046 


1138 



Table 2 

Different EB procedures for \± , . . . , A200 that are evenly spaced between and 5 



A h 


h 





0.2 


1 


1.8 


2.4 


3 




risk 


248 


232 


232 


242 


249 


258 




h 


0.2 


0.3 


0.5 


0.8 


1.0 


1.4 


q = 


risk 


324 


291 


268 


267 


280 


317 


1=\ 


risk 


308 


267 


245 


242 


254 


291 



corresponds to the minimal risk. 

The model studied in Table 2 is of A.;, i = 1, . . . , 200 evenly spaced between 
and 5. Comparing the two halves of the table, one may see how well the normal 
modification works even for such small value of Aj. 

Next, in Table 3, we study the case where Ai = • • • = A200 = 10. Here the 
advantage of the adjusted Poisson over the modified normal is clear. 

Next we study the following a situation where we have a few large Ai values: 
Ai = • • • = A200 = 5, while A201 = • • ■ = A220 = 15- There is still a clear 
advantage of the adjusted Poisson over he modified normal. See Table 4 

Finally we investigate a configuration with only n = 30 observations spread 



Table 3 

Different EB procedures for Ai = • • ■ = A200 = 10. 





h 





0.2 


0,1 


1 


2 


3 




risk 


253 


121 


90 


54 


38 


28 




h 


0.2 


0.3 


0.5 


0.7 


0.9 


1.3 


q = 


risk 


369 


231 


208 


290 


455 


822 


9 = 1 


risk 


330 


197 


180 


265 


442 


808 
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Table 4 

Different EB procedures for Xi = • • • = A200 = 5, while A201 = ■ ■ • = A220 = 15. 





h 





0.2 


0. 1 


1.2 


2.0 


3 




risk 


665 


476 


471 


449 


462 


483 


Ajv.h 


h 


0.2 


0.3 


0.5 


0.9 


1.1 


1.4 


g = 


risk 


857 


687 


580 


696 


766 


854 


* = \ 


risk 


819 


613 


550 


653 


732 


823 



Table 5 

Different EB procedures for Ai , . . . , A30 that are evenly spread between and 20. 





h 





0.2 


0.4 


1.2 


2.0 


3 




risk 


867 


256 


249 


256 


262 


260 


AiV.h 


h 


0.2 


0.3 


0.5 


0.9 


1.2 


1.4 


g = 


risk 


316 


303 


278 


241 


241 


239 




risk 


316 


302 


280 


243 


236 


239 



over a larger interval. The A; arc evenly spread between and 20. For this config- 
uration there is a slight advantage of the modified normal procedure. In order to 
demonstrate the discontinuity of A/,, mentioned in Remark 1, we approximated 
the the risk of A^ for h = 0.01, based on 1000 simulations. The approximated 
risk is 244, compared to 867, for h = 0, this is also the minimal approximated 
risk from the values of h that we tried in Table 5. Note, that 867, the average 
simulated loss that corresponds to h = 0, is much larger than the risk of the 
naive procedure which estimate A^ by Yi\ the risk of the later is the sum of the 
variances which equals 10 x 30 = 300. 

Remark on sparsity: Our procedure is less efficient in situations with ex- 
treme " sparsity" . Here by " sparsity" we mean that the vast majority of the 
values of \ are equal to a certain known value Ao and very few others are very 
different from Ao - In such a cases thresholding procedures might perform better, 
i.e., "round" to Ao the estimators of A^ for Yi which are not too far from Ao- 

6. Choosing the smoothing-parameter by Cross-validation 

In practice we need to choose h in order to apply our adjusted Poisson method. 
In this section we will suggest a slightly non-standard way for cross validation. 
It is explained in the Poisson context, and then in the normal context. The 
same general idea works for other cases where an observation may be factorized, 
e.g., for infinitely divisible experiments. About factorization of experiments, see 
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Greenshtein (1996) and references there. 

A standard situation in cross validation analysis is the following. The sample 
is composed of pairs Zi = (Xi,Yi), i = l,...,n, where Xi is considered as 
an explanatory variable, and Yi is the dependent variable. We have a class 
of predictors which depends on a parameter: {8h(-\ Z\, . . . , Z n ) : h e H}. 
Often, every value of h represents a different trade-off between variance and 
bias. The problem is how to select a good value of h. One approach is based 
on setting aside a test sub-sample, Z m +x, . . . , Z n , say. We can construct the 
predictors Th{-, Z\, . . . , Z m ), h S H, and use the test-bed sample for validation, 

e.g., compute S(h) = SIL m +i f^PQ; Z x , . . . , Z m ) - Y^j . On the face of it, 
the Poisson sample we consider does not have this structure. There are no 
explanatory and dependent variables, just one observation, and the observations 
are not i.i.d., at least not conditionally on Ai, . . . , A n . Yet, we can separate the 
sample to two independent samples. 

Let p G (0, 1), p w 1, and let U\, ...,U n be independent given Yi, . . . , Y„, 
Ui ~ B(Yi,p), i = 1, . . . , n. As is well known, one of the features of the Poisson 
distribution is that Ui ~ Po{p\), and V\ = Yi — Ui ~ Po((l — p)K), and they 
are independent given Ai, . . . , A„. We will use the main sub-sample U\, . . . , U n 
for the construction of the family of estimators (parameterized by h) , while the 
auxiliary sub sample, Vi, . . . , V n for validation. Let h € H be a family of 

estimators, based on U\, . . . ,U n such that S^(Ui) estimates p\i, i = l,...,n. 
Consider: 

V(h;U,V) 

1 71 9 

= -E(to)-p(i-j')- 1 ^) 
»=i 

= \ zZ(fiWi) P Xi) P (i p)- 1 ^ - (i - p)\ l )) 2 

i=l 
1 ™ 



where A„ is a random quantity that does not depend on h, and has no impor- 
tance to the selection of h, while 



= n _^ J2(^(Ui)-pXi)(Vi - (l-p)Xi). (13) 



____ V 

Since Vi , . . . , Ki are independent and independent of t/i , . . . , U n given Ai , 



E(Rl(h)\U,\) = {1 t P p)n 2 zZ(^(Ui)- P Xi) 2 Xi. (14) 



We conclude that if (1 — p)n/ max{Xi}\H\ — > oo, then 

V(h;U,V)=L(6tp\)+o p (l), (15) 
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uniformly in h £ H. Recall that the decision function 6^ used in the above 
result, is the non-parametric empirical Bayes procedure based on U\, . . . ,U n 
and S^iUi) is estimating p\i. If also p — > 1, we suggest to use the value h that 
minimizes V(h; U, V), to construct a similar estimator based on the original 
sample Y\, . . . , Y n , estimating Ai, . . . , A„. 

V(h; U, V), given the sample Y\, ... ,Y n is a randomized estimator of the loss 
function. Once again we suggest in this paper to replace a randomized estimator 

by its expectation given the sample E^V(h; U ', V) Y^j . This expectation can be 
estimated by a Monte Carlo integration — sampling K i.i.d. samples of U and 
V. 

For the normal model, Z± ~ N(fj,i,l), i = 1, ...,n, let ~ -/V(0, 1) be 
auxiliary i.i.d. variables, independent of Y\, . . . ,Y n . Define E7, = Yi + aej, = 
Yi — (l/a)e,-. Then t/j and V* are independent both with mean /itj, and with 
variances 1 + a 2 and 1 + (1/a 2 ) correspondingly. Again, [/ may be used for 
estimation and V for validation, where a > 0, a — > 0. 

Example 2: Consider the configuration Ai = • • • = A200 = 10, simulated in 
Table 3 Section 5. In that table h = 3 is recommended with a noticeable ad- 
vantage over h < 0.4. We applied the above cross validation procedure with 
p = 0.9 on a single realization of Yi, i = 1, . . . ,200. We repeated the cross- 
validation process K = 10000 times on this single realization for the values 
h e {0,0.5,1,1.5,2,2.5,3}. The corresponding numbers (scaled by (1 — p) 2 ) 
were: 165.834, 164.862, 164.736, 164.457, 164.421, 164.286, 164.340. Note that, 
the last numbers represent mainly the variance of our validation variable, but 
the success of the corresponding estimator is also a factor. The numbers indi- 
cate that the choices h = 0, 0.5, 1 are inferior, the formal recommended choice 
is h = 2.5, the second best is h = 3. 

We repeated the simulation on another single realization, again K = 10000, 
this time we took p = 0.85. The corresponding numbers are: 220.562, 217.986, 
217.706, 217.374, 217.209, 217.272, 217.247. Again, the numbers indicate that 
the choices h = 0, 0.5, 1 are inferior. The formal recommended choice is h = 2, 
the second best is again h = 3. 



7. Real Data Example. 

In the following we study an example based on real data about car accidents 
with injuries in 109 towns in Israel in July 2008. The 109 towns are those that 
had at least one accident with injuries in that period of time, in the following we 
ignore this selection bias. There were 5 Tuesdays, Wednesdays and Thursdays, 
in that month. For Town i, let Yi be the total number of accidents with injuries 
in those 5 Wednesdays. Similarly, for Town i, let Zi be the average number of 
accidents with injuries in the corresponding Tuesdays and Thursdays. We mod- 
elled Yi as independent distributed Po(A,). In the following we will report on the 
performance of our cmpirical-Bayes estimator for various smoothing parameters 
h. It is evaluated through the 'empirical risk': 

^(A,(y,)-^) 2 - 
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Table 6 

EB applied to traffic accident by city 





h 





0.5 


1 


1.5 


2 


3 




Risk 


140 


163 


172 


168 


166 


159 




h 


0.2 


0.6 


1 


2 


3 


4 




Risk 


262 


185 


174 


170 


183 


202 



The towns Tel- Aviv and Jerusalem had a heavy impact on the risk and thus we 
excluded them from the analysis. The remaining data seems to have relatively 
low values of Ai , a case where the classical Poisson-EB procedure is expected to 
perform well, and indeed it is. The range of Yi is 0-14, while X) ^ = 135, and 
^2 Y? = 805. In this example, the classical Poisson-EB adjusted for monotonicity 
(i.e., h = 0), gave the best result. Applying a smoothing parameter ft, > is 
slightly inferior based on the above empirical risk. Yet, it is re-assuring to see 
how stable is the performance of A/ l; as h varies. The empirical loss for the 
naive procedure estimating A^ by Yi, is 240. The modified normal estimators 
with q = j and various values of h was applied to the data as well. Again 
a clear advantage of our class of adjusted Poisson procedures over the class 
of modified normal procedures was observed. In particular, the former class is 
much more stable with respect to the choice of the smoothing parameter h. The 
results are summarized in Table 6. 



8. The nonparametric MLE 

The nonparametric maximum-likelihood (NPMLE) is an alternative approach 
for estimating S . It yields, automatically, a monotone and smooth decision 
function. See Jian and Zhang (2009) for the normal model. To simplify the 
discussion, we will assume that Ai, . . . , A n arc i.i.d. random variables sampled 
from the distribution G. 

Note that the NPMLE maximizes with respect to G, the likelihood function: 

1 n 

i=0 
oo 

£(F„(i-l) -#„(»)) IogPG(i) 
i=Q 

logp G (0) + £#„(!) log* 



n 

i=l i=0 
oo 



i=0 



logp G (0) + f ™« M G M + C(y). 



i=0 
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where P„ is the empirical process, P„(i) = P„({i}), and F„(i) = ^n(j) 
(F„(-l) = l)._ 

Suppose G is supported on [a, 6]. Extend 



5 G M= JXve-^dGiX) ' 



Then, clearly, S G (y) £ [a, b]. Moreover, it is monotone non-decreasing with 
derivative <5 (y) = cov(A,logA) € [0,&log& — a log a] (the covariance is with 
respect to measure X v e~ x dG(X) normalized) 

It is well known that the NPMLE is discrete with point mass g±, . . . , g/. on 
Ai, . . . , Afc say. It is easy to see that it satisfies 

" X Vi 

Since the left hand side is a polynomial in A of degree max yi , and a polynomial 
of degree q in A can be equal to exp{ A} only q times, we conclude that k < max yi 
(a more careful argument can reduce the bound on the support size). Hence, it 
is feasible to approximate algorithmically the NPMLE. Pursuing the asymptotic 
properties of the NPMLE estimator of 6 is beyond the scope of this paper and 
we intend to do it somewhere else. 
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